{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Downloading Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Polyglot requires a model for each task and language.\n", "These models are essential for the library to function.\n", "Given the large size of some of the models, we distribute the models through a download manager separately. The download manager has several modes of operation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modes of Operation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Command Line Mode\n", "\n", "The subcommand `download` takes a package or more as an argument and download the specified packages in the `polyglot_data` directory." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: polyglot download [-h] [--dir DIR] [--quiet] [--force] [--exit-on-error] [--url SERVER_INDEX_URL] [packages [packages ...]]\r\n", "\r\n", "positional arguments:\r\n", " packages packages to be downloaded\r\n", "\r\n", "optional arguments:\r\n", " -h, --help show this help message and exit\r\n", " --dir DIR download package to directory DIR\r\n", " --quiet work quietly\r\n", " --force download even if already installed\r\n", " --exit-on-error exit if an error occurs\r\n", " --url SERVER_INDEX_URL\r\n", " download server index url\r\n" ] } ], "source": [ "!polyglot download --help" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[polyglot_data] Downloading package morph2.en to\r\n", "[polyglot_data] /home/rmyeid/polyglot_data...\r\n", "[polyglot_data] Package morph2.en is already up-to-date!\r\n" ] } ], "source": [ "!polyglot download morph2.en" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interactive Mode\n", "\n", "You can reach this mode by not supplying any arguments to the command line." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Polyglot Downloader\r\n", "---------------------------------------------------------------------------\r\n", " d) Download l) List u) Update c) Config h) Help q) Quit\r\n", "---------------------------------------------------------------------------\r\n", "Downloader> " ] } ], "source": [ "!polyglot download" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Library Interface" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from polyglot.downloader import downloader\n", "downloader.download(\"embeddings2.en\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collections" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You noticed, by now, that we can install a specific model by specifying its name and the target language.\n", "\n", "Package name format is `task_name.language_code`\n", "\n", "#### Langauge Collections\n", "\n", "Packages are grouped by language. For example, if we want to download all the models that are specific to Arabic, the arabic collection of models name is **LANG:** followed by the language code of Arabic which is `ar`.\n", "\n", "Therefore, we can just run:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[polyglot_data] Downloading collection u'LANG:ar'\r\n", "[polyglot_data] | \r\n", "[polyglot_data] | Downloading package tsne2.ar to\r\n", "[polyglot_data] | /home/rmyeid/polyglot_data...\r\n", "[polyglot_data] | Package tsne2.ar is already up-to-date!\r\n", "[polyglot_data] | Downloading package transliteration2.ar to\r\n", "[polyglot_data] | /home/rmyeid/polyglot_data...\r\n", "[polyglot_data] | Package transliteration2.ar is already up-to-\r\n", "[polyglot_data] | date!\r\n", "[polyglot_data] | Downloading package morph2.ar to\r\n", "[polyglot_data] | /home/rmyeid/polyglot_data...\r\n", "[polyglot_data] | Package morph2.ar is already up-to-date!\r\n", "[polyglot_data] | Downloading package counts2.ar to\r\n", "[polyglot_data] | /home/rmyeid/polyglot_data...\r\n", "[polyglot_data] | Package counts2.ar is already up-to-date!\r\n", "[polyglot_data] | Downloading package sentiment2.ar to\r\n", "[polyglot_data] | /home/rmyeid/polyglot_data...\r\n", "[polyglot_data] | Package sentiment2.ar is already up-to-date!\r\n", "[polyglot_data] | Downloading package embeddings2.ar to\r\n", "[polyglot_data] | /home/rmyeid/polyglot_data...\r\n", "[polyglot_data] | Package embeddings2.ar is already up-to-date!\r\n", "[polyglot_data] | Downloading package ner2.ar to\r\n", "[polyglot_data] | /home/rmyeid/polyglot_data...\r\n", "[polyglot_data] | Package ner2.ar is already up-to-date!\r\n", "[polyglot_data] | \r\n", "[polyglot_data] Done downloading collection LANG:ar\r\n" ] } ], "source": [ "!polyglot download LANG:ar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Task Collections\n", "\n", "Packages are grouped by task. For example, if we want to download all the models that perform transliteration. The collection name is **TASK:** followed by the task name.\n", "\n", "Therefore, we can just run:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "downloader.download(\"TASK:transliteration2\", quiet=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Langauge & Task Support" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can query our download manager for which tasks are supported by polyglot, as the following:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'embeddings2',\n", " u'counts2',\n", " u'pos2',\n", " u'ner2',\n", " u'sentiment2',\n", " u'morph2',\n", " u'tsne2']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "downloader.supported_tasks(lang=\"en\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can query our download manager for which languages are supported by polyglot named entity recognition subsystem, as the following:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1. Polish 2. Turkish 3. Russian \n", " 4. Indonesian 5. Czech 6. Arabic \n", " 7. Korean 8. Catalan; Valencian 9. Italian \n", " 10. Thai 11. Romanian, Moldavian, ... 12. Tagalog \n", " 13. Danish 14. Finnish 15. German \n", " 16. Persian 17. Dutch 18. Chinese \n", " 19. French 20. Portuguese 21. Slovak \n", " 22. Hebrew (modern) 23. Malay 24. Slovene \n", " 25. Bulgarian 26. Hindi 27. Japanese \n", " 28. Hungarian 29. Croatian 30. Ukrainian \n", " 31. Serbian 32. Lithuanian 33. Norwegian \n", " 34. Latvian 35. Swedish 36. English \n", " 37. Greek, Modern 38. Spanish; Castilian 39. Vietnamese \n", " 40. Estonian \n" ] } ], "source": [ "print(downloader.supported_languages_table(task=\"ner2\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can view all the available and/or installed collections or packages through the list function" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using default data directory (/home/rmyeid/polyglot_data)\n", "=========================================\n", " Data server index for \n", "=========================================\n", "Collections:\n", " [ ] LANG:af............. Afrikaans packages and models\n", " [ ] LANG:als............ als packages and models\n", " [ ] LANG:am............. Amharic packages and models\n", " [ ] LANG:an............. Aragonese packages and models\n", " [ ] LANG:ar............. Arabic packages and models\n", " [ ] LANG:arz............ arz packages and models\n", " [ ] LANG:as............. Assamese packages and models\n", " [ ] LANG:ast............ Asturian packages and models\n", " [ ] LANG:az............. Azerbaijani packages and models\n", " [ ] LANG:ba............. Bashkir packages and models\n", " [ ] LANG:bar............ bar packages and models\n", " [ ] LANG:be............. Belarusian packages and models\n", " [ ] LANG:bg............. Bulgarian packages and models\n", " [ ] LANG:bn............. Bengali packages and models\n", " [ ] LANG:bo............. Tibetan packages and models\n", " [ ] LANG:bpy............ bpy packages and models\n", " [ ] LANG:br............. Breton packages and models\n", " [ ] LANG:bs............. Bosnian packages and models\n", " [ ] LANG:ca............. Catalan packages and models\n", " [ ] LANG:ce............. Chechen packages and models\n", " [ ] LANG:ceb............ Cebuano packages and models\n", " [ ] LANG:cs............. Czech packages and models\n", " [ ] LANG:cv............. Chuvash packages and models\n", " [ ] LANG:cy............. Welsh packages and models\n", " [ ] LANG:da............. Danish packages and models\n", " [ ] LANG:de............. German packages and models\n", " [ ] LANG:diq............ diq packages and models\n", " [ ] LANG:dv............. Divehi packages and models\n", " [ ] LANG:el............. Greek packages and models\n", " [P] LANG:en............. English packages and models\n", " [ ] LANG:eo............. Esperanto packages and models\n", " [ ] LANG:es............. Spanish packages and models\n", " [ ] LANG:et............. Estonian packages and models\n", " [ ] LANG:eu............. Basque packages and models\n", " [ ] LANG:fa............. Persian packages and models\n", " [ ] LANG:fi............. Finnish packages and models\n", " [ ] LANG:fo............. Faroese packages and models\n", " [ ] LANG:fr............. French packages and models\n", " [ ] LANG:fy............. Western Frisian packages and models\n", " [ ] LANG:ga............. Irish packages and models\n", " [ ] LANG:gan............ gan packages and models\n", " [ ] LANG:gd............. Scottish Gaelic packages and models\n", " [ ] LANG:gl............. Galician packages and models\n", " [ ] LANG:gu............. Gujarati packages and models\n", " [ ] LANG:gv............. Manx packages and models\n", " [ ] LANG:he............. Hebrew packages and models\n", " [ ] LANG:hi............. Hindi packages and models\n", " [ ] LANG:hif............ hif packages and models\n", " [ ] LANG:hr............. Croatian packages and models\n", " [ ] LANG:hsb............ Upper Sorbian packages and models\n", " [ ] LANG:ht............. Haitian packages and models\n", " [ ] LANG:hu............. Hungarian packages and models\n", " [ ] LANG:hy............. Armenian packages and models\n", " [ ] LANG:ia............. Interlingua packages and models\n", " [ ] LANG:id............. Indonesian packages and models\n", " [ ] LANG:ilo............ Iloko packages and models\n", " [ ] LANG:io............. Ido packages and models\n", " [ ] LANG:is............. Icelandic packages and models\n", " [ ] LANG:it............. Italian packages and models\n", " [ ] LANG:ja............. Japanese packages and models\n", " [ ] LANG:jv............. Javanese packages and models\n", " [ ] LANG:ka............. Georgian packages and models\n", " [ ] LANG:kk............. Kazakh packages and models\n", " [ ] LANG:km............. Khmer packages and models\n", " [ ] LANG:kn............. Kannada packages and models\n", " [ ] LANG:ko............. Korean packages and models\n", " [ ] LANG:ku............. Kurdish packages and models\n", " [ ] LANG:ky............. Kyrgyz packages and models\n", " [ ] LANG:la............. Latin packages and models\n", " [ ] LANG:lb............. Luxembourgish packages and models\n", " [ ] LANG:li............. Limburgish packages and models\n", " [ ] LANG:lmo............ lmo packages and models\n", " [ ] LANG:lt............. Lithuanian packages and models\n", " [ ] LANG:lv............. Latvian packages and models\n", " [ ] LANG:mg............. Malagasy packages and models\n", " [ ] LANG:mk............. Macedonian packages and models\n", " [ ] LANG:ml............. Malayalam packages and models\n", " [ ] LANG:mn............. Mongolian packages and models\n", " [ ] LANG:mr............. Marathi packages and models\n", " [ ] LANG:ms............. Malay packages and models\n", " [ ] LANG:mt............. Maltese packages and models\n", " [ ] LANG:my............. Burmese packages and models\n", " [ ] LANG:ne............. Nepali packages and models\n", " [ ] LANG:nl............. Dutch packages and models\n", " [ ] LANG:nn............. Norwegian Nynorsk packages and models\n", " [ ] LANG:no............. Norwegian packages and models\n", " [ ] LANG:oc............. Occitan packages and models\n", " [ ] LANG:or............. Oriya packages and models\n", " [ ] LANG:os............. Ossetic packages and models\n", " [ ] LANG:pa............. Punjabi packages and models\n", " [ ] LANG:pam............ Pampanga packages and models\n", " [ ] LANG:pl............. Polish packages and models\n", " [ ] LANG:pms............ pms packages and models\n", " [ ] LANG:ps............. Pashto packages and models\n", " [ ] LANG:pt............. Portuguese packages and models\n", " [ ] LANG:qu............. Quechua packages and models\n", " [ ] LANG:rm............. Romansh packages and models\n", " [ ] LANG:ro............. Romanian packages and models\n", " [ ] LANG:ru............. Russian packages and models\n", " [ ] LANG:sa............. Sanskrit packages and models\n", " [ ] LANG:sah............ Sakha packages and models\n", " [ ] LANG:scn............ Sicilian packages and models\n", " [ ] LANG:sco............ Scots packages and models\n", " [ ] LANG:se............. Northern Sami packages and models\n", " [ ] LANG:sh............. Serbo-Croatian packages and models\n", " [ ] LANG:si............. Sinhala packages and models\n", " [ ] LANG:sk............. Slovak packages and models\n", " [ ] LANG:sl............. Slovenian packages and models\n", " [ ] LANG:sq............. Albanian packages and models\n", " [ ] LANG:sr............. Serbian packages and models\n", " [ ] LANG:su............. Sundanese packages and models\n", " [ ] LANG:sv............. Swedish packages and models\n", " [ ] LANG:sw............. Swahili packages and models\n", " [ ] LANG:szl............ szl packages and models\n", " [ ] LANG:ta............. Tamil packages and models\n", " [ ] LANG:te............. Telugu packages and models\n", " [ ] LANG:tg............. Tajik packages and models\n", " [ ] LANG:th............. Thai packages and models\n", " [ ] LANG:tk............. Turkmen packages and models\n", " [ ] LANG:tl............. Tagalog packages and models\n", " [ ] LANG:tr............. Turkish packages and models\n", " [ ] LANG:tt............. Tatar packages and models\n", " [ ] LANG:ug............. Uyghur packages and models\n", " [ ] LANG:uk............. Ukrainian packages and models\n", " [ ] LANG:ur............. Urdu packages and models\n", " [ ] LANG:uz............. Uzbek packages and models\n", " [ ] LANG:vec............ vec packages and models\n", " [ ] LANG:vi............. Vietnamese packages and models\n", " [ ] LANG:vls............ vls packages and models\n", " [ ] LANG:vo............. Volapük packages and models\n", " [ ] LANG:wa............. Walloon packages and models\n", " [ ] LANG:war............ Waray packages and models\n", " [ ] LANG:yi............. Yiddish packages and models\n", " [ ] LANG:yo............. Yoruba packages and models\n", " [ ] LANG:zh............. Chinese packages and models\n", " [ ] LANG:zhc............ Chinese Character packages and models\n", " [ ] LANG:zhw............ zhw packages and models\n", " [ ] TASK:counts2........ counts2\n", " [ ] TASK:embeddings2.... embeddings2\n", " [ ] TASK:ner2........... ner2\n", " [P] TASK:sentiment2..... sentiment2\n", " [ ] TASK:tsne2.......... tsne2\n", "\n", "([*] marks installed packages; [P] marks partially installed collections)\n" ] } ], "source": [ "downloader.list(show_packages=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }